Search CORE

202 research outputs found

Fun, Not Competition: The Story of My Math Club

Author: Hardin Johanna
Publication venue: Scholarship @ Claremont
Publication date: 31/01/2018
Field of study

For almost three years, I have spent most of my Sunday afternoons doing math with my daughters and a group of their school friends. Below I detail why and how the math club is run. Unlike my day job, which is full of (statistical) learning objectives for my college students, my math club has only the objective that the kids I work with learn to associate mathematics with having fun. My math club has its challenges, but the motivation comes from love of mathematics, which makes it fun, and worth every minute

Scholarship@Claremont

A method for generating realistic correlation matrices

Author: Garcia Stephan Ramon
Golan David
Hardin Johanna
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2013
Field of study

Simulating sample correlation matrices is important in many areas of statistics. Approaches such as generating Gaussian data and finding their sample correlation matrix or generating random uniform

[-1,1]

deviates as pairwise correlations both have drawbacks. We develop an algorithm for adding noise, in a highly controlled manner, to general correlation matrices. In many instances, our method yields results which are superior to those obtained by simply simulating Gaussian data. Moreover, we demonstrate how our general algorithm can be tailored to a number of different correlation models. Using our results with a few different applications, we show that simulating correlation matrices can help assess statistical methodology.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS638 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Scholarship@Claremont

Crossref

Network Analysis with the Enron Email Corpus

Author: Hardin Johanna
Sarkis Ghassan
Urc P. C.
Publication venue
Publication date: 04/08/2015
Field of study

We use the Enron email corpus to study relationships in a network by applying six different measures of centrality. Our results came out of an in-semester undergraduate research seminar. The Enron corpus is well suited to statistical analyses at all levels of undergraduate education. Through this note's focus on centrality, students can explore the dependence of statistical models on initial assumptions and the interplay between centrality measures and hierarchical ranking, and they can use completed studies as springboards for future research. The Enron corpus also presents opportunities for research into many other areas of analysis, including social networks, clustering, and natural language processing.Comment: in Journal of Statistics Education, Volume 23, Number 2, 201

arXiv.org e-Print Archive

CiteSeerX

Microarray Data from a Statistician’s Point of View

Author: Hardin Johanna S.
Publication venue: Scholarship @ Claremont
Publication date: 01/01/2005
Field of study

Scholarship@Claremont

Changes Across 25 Years of Statistics in Medicine

Author: Hardin Johanna S.
Publication venue: Scholarship @ Claremont
Publication date: 01/01/2012
Field of study

[This piece is a series of interviews with giants in the field of medicine on their views of how statistics is changing medicine. I interviewed the editor of the New England Journal of Medicine, a preeminent doctor/researcher of lung cancer, the director of the LA County Department of Public Health, and a Harvard statistician who sits on the editorial board of the New England Journal of Medicine.

Scholarship@Claremont

Prediction Error Estimation in Random Forests

Author: Hardin Johanna
Krupkin Ian
Publication venue
Publication date: 01/09/2023
Field of study

In this paper, error estimates of classification Random Forests are quantitatively assessed. Based on the initial theoretical framework built by Bates et al. (2023), the true error rate and expected error rate are theoretically and empirically investigated in the context of a variety of error estimation methods common to Random Forests. We show that in the classification case, Random Forests' estimates of prediction error is closer on average to the true error rate instead of the average prediction error. This is opposite the findings of Bates et al. (2023) which were given for logistic regression. We further show that this result holds across different error estimation strategies such as cross-validation, bagging, and data splitting.Comment: arXiv admin note: text overlap with arXiv:2104.00673 by other author

arXiv.org e-Print Archive

Differential expression analysis for multiple conditions

Author: Evans Ciaran
Hardin Johanna
Huber Mark
Stoebel Daniel
Wong Garrett
Publication venue
Publication date: 13/10/2014
Field of study

As high-throughput sequencing has become common practice, the cost of sequencing large amounts of genetic data has been drastically reduced, leading to much larger data sets for analysis. One important task is to identify biological conditions that lead to unusually high or low expression of a particular gene. Packages such as DESeq implement a simple method for testing differential signal when exactly two biological conditions are possible. For more than two conditions, pairwise testing is typically used. Here the DESeq method is extended so that three or more biological conditions can be assessed simultaneously. Because the computation time grows exponentially in the number of conditions, a Monte Carlo approach provides a fast way to approximate the

p

-values for the new test. The approach is studied on both simulated data and a data set of {\em C. jejuni}, the bacteria responsible for most food poisoning in the United States

arXiv.org e-Print Archive

CiteSeerX

Integrating computing in the statistics and data science curriculum: Creative structures, novel skills and habits, and ways to teach computational thinking

Author: Hardin Johanna S.
Horton Nicholas J.
Publication venue
Publication date: 22/12/2020
Field of study

Nolan and Temple Lang (2010) argued for the fundamental role of computing in the statistics curriculum. In the intervening decade the statistics education community has acknowledged that computational skills are as important to statistics and data science practice as mathematics. There remains a notable gap, however, between our intentions and our actions. In this special issue of the *Journal of Statistics and Data Science Education* we have assembled a collection of papers that (1) suggest creative structures to integrate computing, (2) describe novel data science skills and habits, and (3) propose ways to teach computational thinking. We believe that it is critical for the community to redouble our efforts to embrace sophisticated computing in the statistics and data science curriculum. We hope that these papers provide useful guidance for the community to move these efforts forward.Comment: In press, Journal of Statistics and Data Science Educatio

arXiv.org e-Print Archive